Executive Summary and Overview:

To the Line Producer of Company X,

Is there a model that can help predict the future box office success of a film? How can production companies better predict which will be box office hits and which will be box office flops? Take for example, Andrew Stanton, coming off of the success of Finding Nemo (2004) and Wall E (2008) was lined up to direct another box office hit, John Carter (2012) catered towards a similar audience but did not receive the same box office warmth as his former two outings.

The following report aims to show what seem to be relevant factors that affect box office sales of a particular movie. We will be using a variety of modeling techniques (Random Forest, Decision Tree, Linear Regression) to aid us in predicting how much a a film will make given the its budget, run time, genre, crew size, year released, and week day released among a host of other variables to consider.

A word of caution before we proceed with the development, fine tuning, and interpretation of our models. The models we are about to create will only be as good as the data that is fed into them. Throughout the Exploratory Data Analysis, I will be sure to highlight where the data used to create the model may not be reflective of the true distribution of films released worldwide which made lead our model to overweight particular films.

It is my hope that this report will give you a better understanding of some underlying economic forces that determine a film’s box office and help you plan accordingly for your film’s budget. Some limitations of this report include the fact that it does not participate in two deeper layers of analysis:

1.) The following models, in the author’s opinion lack the nuance in identifying the idea that the creative process and artistic merit a film doesn’t always equate to a financial success. This model lacked a variety of metrics for success, mainly using revenue and popularity as a response variables and explanatory variables, respectively. If this analysis and modeling were to be redone with more extensive data, I would aim to include other metrics of success such as rotten tomato scores, and number of award nominations.

2.) A further break down of demographics in the revenue a film generates. The data on film revenue is an aggregate across several demographics and is impossible to determine a complete breakdown by demographics. This layer of analysis would be pertinent to a line producer’s job to determine the viability a film project given your specific demographics. For example, a line producer drafting a budget for an independent film in urban New York City will most likely have a different demographic to keep in mind than a blockbuster film in a rural part of Mexico.. This starts to enter the realm of marketing and segmenting for specific customers.

In other words, the author of this report acknowledges that what the model proposes to be a good formula for a box office hit may not lead to perfectly ethical casting or creative decisions.

But enough with the the opening credits, as they say in the industry: “Lights, camera, ….action!”

Libraries

First, let’s load the libraries we will be using:

library(tidyverse)
library(tidymodels)
library(lubridate)
library(robotstxt)
library(data.table)
library(gridExtra)
library(glue)
library(rpart.plot)
library(reactable)
library(naniar)
library(visdat)
library(wesanderson)
library(pander)

Data Collection

Loading CSV Files

The loading and cleaning of the data largely mirrors the process outlined by Saba Tavoosi

#Loading the Data
films_train <- read.csv("data/train.csv", na.strings=c("", '#N/A', '[]', '0'))
films_test <- read.csv("data/test.csv", na.strings=c("", '#N/A', '[]', '0'))

Cleaning CSV Files

# Cleaning function
clean_data <- function(dataset){
  data <- dataset %>% 
  select(-`poster_path`, - `tagline`, - `overview`, `homepage`) %>% #removing variables that won't be considered for further analysis
  mutate(release_date = parse_date_time2(release_date, "mdy", cutoff_2000 = 20), #ensuring years after 2000 are correctly identified in the right century.
         release_year = lubridate::year(ymd(release_date)), #grab year number from release_date
         release_month = lubridate::month(ymd(release_date)), #grab month name from release_date
         release_month_abbr = factor(month.abb[release_month],levels=month.abb),
         release_quarter = lubridate::quarter(ymd(release_date)), #grab quarter number from release_date
         release_week = lubridate::week(ymd(release_date)), #grab week number from release_date
         release_wday = lubridate::wday(ymd(release_date)), #grab weekday number from release_date
         release_wday_name = weekdays.Date(release_date, abbreviate = TRUE), #grab weekday abbreviated name from release_date
         original_language = as.factor(original_language), 
         is_english = case_when(original_language == "en" ~ "English",
                              original_language != "en" ~ "Non English"), #new variable of two levels identifying whether it is an english or non-english speaking film.
         genres = as.factor(genres),
         main_genre = str_extract(genres, "Comedy|Horror|Action|Drama|Documentary|Science Fiction|
              Crime|Fantasy|Thriller|Animation|Adventure|Mystery|War|Romance|Music|
              Family|Western|History|TV Movie|Foreign"), #identifies the main genre of the movie according choosing from the given list
         status = as.factor(status),
         series = str_extract(belongs_to_collection, "(?<=name\\'\\:\\s{1}\\').+(?=\\'\\,\\s{1}\\'poster)"),
         series = ifelse(!is.na(series), as.character(series), "No collection"), # filling NA values with "No Collection"
         production_companies = gsub('(^\\[\\{\'name\'\\:\\s\'|\'\\,\\s\'id.*)', '',
                                 production_companies),
         production_countries = str_extract(production_countries, "[:upper:]+"),
         top_prod_comp = case_when(production_companies == 'Universal Pictures' ~ 'Universal Pictures',
                                   production_companies == 'Paramount Pictures' ~ 'Paramount Pictures',
                                   production_companies == 'Twentieth Century Fox Film Corporation' ~ 'Twentieth Century Fox Film Corporation',
                                   production_companies == 'Columbia Pictures' ~ 'Columbia Pictures',
                                   production_companies == 'New Line Cinema' ~ 'New Line Cinema',
                                   production_companies == 'Warner Bros.' ~ 'Warner Bros.',
                                   production_companies == 'Walt Disney Pictures' ~ 'Walt Disney Pictures'),
         top_prod_comp = ifelse(!is.na(top_prod_comp), as.character(top_prod_comp), "Other"), #filling NA values with "Other"
         is_top_prod_comp = ifelse(top_prod_comp == "Other", "No", "Yes"),
         part_of_franchise = ifelse(!is.na(series), "Yes", "No"),
         all_cast_size = str_count(cast, "name"),
         female_cast_size = str_count(cast, ('gender\'\\:\\s1')),
         male_cast_size = str_count(cast, ('gender\'\\:\\s2')),
         all_crew_size = str_count(crew, 'name'),
         female_crew_size = str_count(crew, ('gender\'\\:\\s1')),
         male_crew_size = str_count(crew, ('gender\'\\:\\s2')),
         genre_count = str_count(genres, 'name'),
         filtered_na = case_when(is.na(budget) ~ "excluded",
                                 is.na(runtime) ~"excluded",
                                 is.na(all_cast_size) ~ "excluded",
                                 is.na(female_cast_size) ~ "excluded",
                                 is.na(genre_count) ~ "excluded",
                                 is.na(production_companies) ~ "excluded"),
         filtered_na = ifelse(!is.na(filtered_na), as.character(filtered_na), "included"),
         budget = budget/1000000
  )
  first_in_series <- data %>% 
  filter(!is.na(series)) %>% 
  arrange(series,
          release_date) %>% 
  group_by(series) %>% 
  slice(1) %>% 
  mutate(order = "before") %>% 
  select(imdb_id,
         order)

data <- data %>% 
  left_join(first_in_series, by = c("imdb_id" = "imdb_id")) %>%
  rename(series = series.x) %>% 
  mutate(order = ifelse(is.na(order), "after", as.character(order))) %>% 
  select(-c(`id`, `belongs_to_collection`, `homepage`, `imdb_id`, `status`, `title`, `Keywords`, series.y, )) #removing columns impertinent to future modeling.

}

#Cleaning the Data
films_train <- clean_data(films_train) %>% 
  mutate(revenue = revenue/1000000)
films_test <- clean_data(films_test)
films_all <- bind_rows(films_train, films_test)

A note about the titles of the data set: The original “test” data set had NA values for revenue as that column was left for modeling to fill in values. However, in order to be able to assess the accuracy of the training on the testing data set (hence, supervised learning), I needed a testing data set that had the “right answer” provided in order to see how far off my models were (Using metrics such as Mean Absolute Error and Mean Absolute Percentage Error). Hence, I decided to split the training data set into a training data set and testing data set.

Critique of Prior Work

An overview of their work

Technical Choices:

Saba Tavoosi did well in extracting information from JSON data but I would recommend on providing more text analysis explaining her reasoning behind the regular expressions used. Some were convoluted and the rest her works depends on a level of trust that regular expressions are extracting the information she wanted.

Saba Tavoosi painstakingly and with accuracy populated her data set with extra research and manually entering missing data for budget. This communicates a high priority in creating a complete training set rather than filtering out NA values for budget. Tavoosi could alternatively join data sets together on the uniuqe key variable imdb_id using API’s to accumulate more data with their corresponding budget and box office information. A pro of Tavoosi’s method is that it allows her to research individual films and populate as many of her own variables (all_cast_size, runtime) as she can . A con of Tavoosi’s method is the time inefficiency in doing this process for several more cases.

Another technical decision by Saba Tavoosi was her inclusion of plots using only numeric or integer variables. I intend to do some further data wrangling to obtain the week day names and month names to include more descriptive labels for plots. Part the data wrangling included mostly numbers for week days rather than the week day name itself leading to some confusion what day number 4 means.

Saba Tavoosi focused more on exploratory data analysis (EDA) which was thoroughly done well, much of which will be replicated in this report. Tavoosi included only one model near the end on supervised data but did not provide error metrics for the testing data set. In this report, I hope to expand on this by splitting the training data set further in order to be able to see how the model works on untested dataset. The last step Tavoosi had in her report was to populate the testing data set with predictions but without the actual revenue for each film in the testing data set, it is impossible to determine the accuracy of the model on untested data. I hope to extend from Tavoosi’s modeling work by provding three or four more models such as a decision tree and a linear regression. I will also explore taking out variables such as popularity.

Trustworthiness

As mentioned before, a lot of trust goes behind the regular expressions to extract any meaningful analysis when using the variables with json data. Tavoosi could have done more to gain trust in her regular expressions by explainig the process and what information she was hoping to extract. The extent of her work includes regular expressions but quickly moves on to plotting without confirmation that the numbers she obtained were indeed, the ones she was expecting.

Data Cleansing

The data cleansing above consisted of removing three columns that were not of immediate pertinence to this discussion. I ensured the release date would be read as a date with the correct corresponding years. I also created columns to describe what quarter of the year a film was released, used regular expressions to remove gratuitous text in the belongs_to_collection and production_companies variables, and labeled which films were released by top production companies which include Universal Pictures, Paramount Pictures, Twentieth Century Fox Film Corporation, Columbia Pictures, New Line Cinema, Warner Brothers, and Walt Disney Pictures. I loaded the test data set and training data set separately as that is the form in which I found them and joined them together to create the data set films_all. Note, the test data set as expected did not have a revenue column as that is the response variable in question so films_all will have as many cases where revenue is not available as there are number of cases in the films_test.

The following plot should illustrate this fact:

films_all %>% 
  select(revenue) %>% 
  vis_miss()

It should be noted that the variables belongs_to_collection, genres, production_company, spoken_languages, Keywords, cast, and crew were imported as json data. Regular expressions, copied from Saba Tavoosi have been used to extract information from each column.

I have decided to remove gratuitous variables that the author of this report has deemed to have little use for future modeling. For example, imdb_id is an identifier column and gives little information about the content of the film. What the ID number may communicate is the age of the film (lower ID numbers may be have been released earlier as more recent films receive higher ID numbers) but the age of the film can be better determined by its release_year.

Data Engineering

I have engineered the following new variables using information from the given columns from the Kaggle data set:

is_top_prod_comp : describes if a film was produced by any of the top six production companies (Universal Pictures, Paramount Pictures, Twentieth Century Fox, Columbia Pictures, New Line Cinema, Warner Brothers, and Walt Disney Pictures.) labelled as “Yes”. If not, is is labeled as “No”. Note the distinction between Twentieth Century Fox and Walt Disney Pictures is permissible as this data set contains films released before Disney’s acquisition of Twentieth Century Fox.

part_of_franchise : describes if a film is part of a collection, franchise, or has a shared characters existing in other films, labelled as “Yes”.If not, labeled as “No”.

order: describes if a film within this data set has another film represented that comes after it in the same collection. The variable orderhas two levels: “before” and “after”. Note, since this data set is not a complete list of films and will frequently omit films that started franchises, I have opted to use the vocabulary “before” and “after” to identity within this data set which films have come before other films in the same franchise. The words before and after should not be used synchronously with original and sequel, respectively, as it is quite possible to have a film in this data set be considered the ‘first’ in its collection but still be a sequel. Also note that due to the filtering join function performed to obtain a list of films that are the “first” in their collection within this data set, the data set will categorize all others as “after” is the only film in the data set. The author and modeler of this report acknowledges the imprecision in this data engineering and encourages the reader to consider this variable to be of little importance in future models.

Data Exploration

NA Values

Before we filter out values in preparation for modeling, let’s examine which cases we will be removing and if they resemble a pattern. Ideally for the models to be unaffected by the absence of certain cases, we would like to see relative similarity between the density plots of revenue for cases that will be included and cases that we are planning to exclude.

vis_dat(films_train)

vis_miss(films_train)

gg_miss_var(films_train)

wes_colors4 <- wes_palette("GrandBudapest1", 4)

films_train %>% 
  ggplot() +
  aes(x = revenue,
      fill = filtered_na)+
  geom_density() +
  scale_x_log10() + #proportions larger revenue values to appear larger.
  scale_fill_manual(values = wes_colors4)

It appears that we have the most missing values from our variable budget with 812 missing values or NA values.

Our density plot clearly shows that removal of our NA values, particularly those in budget requires us to alter our interpretation of the model. The fact that the peak of the “excluded” density falls lower in height (density) and in value (revenue) suggests that many of the budgets that were left as NA in the original data set had very low budgets. Our models then adjust to describe films that only reside in the blue shaded area of films in the “included” portions where the revenue is high. In other words, this model would be more appropriately used for films expected to garner more at the box office than independent films.

We then proceed to filter out the cases in the red shaded areas to be excluded:

films_train <- films_train %>% 
  filter(!is.na(budget), #filtering in preparation for modeling
         !is.na(runtime), #filtering in preparation for modeling
         !is.na(all_cast_size), #filtering in preparation for modeling
         !is.na(female_cast_size), #filtering in preparation for modeling
         !is.na(genre_count), #filtering in preparation for modeling
         !is.na(production_companies)) #filtering in preparation for modeling

Splitting Test and Train Data Sets

For the purposes of supervised modeling, I have decided to use only the testing data and split it further into training and testing. In this way, I will be able to evaluate the accuracy of my model in a supervised manner and calculate error metrics for my test data.

The top 5 cases for the training set are listed below:

set.seed(1) 
film_split <- initial_split(films_train, prop = 9/10)
train <- training(film_split)
test <- testing(film_split) 
  # filter(!is.na(genre_count))

set.seed(616)
film_resamples <- vfold_cv(train, v = 5)


train %>% 
  head(5) %>% 
  select(original_title,
         budget,
         popularity,
         production_companies,
         release_date,
         runtime,
         revenue) %>% 
  reactable(filterable = TRUE,
            searchable = TRUE,
            minRows = 5)

Exploring the Distribution Across the Calendar Year by Month and Quarter

The following code is taken and reworked from Saba Tavoosi

# Year released
year_plot <- films_train %>% 
  ggplot(aes(x = release_year,
             y = revenue,
             color = release_year)) +
   geom_point() +
   geom_smooth(method = 'lm', color = 'red3', fill = 'red3') + 
   scale_y_continuous(breaks = c(0, 500000000, 1000000000, 1500000000),
                      labels = c('$0', '$500', '$1000', '$1500')) +
   theme_classic() +
   theme(legend.position = 'none') +
   labs(title = 'Revenue by year released', x = 'Release year', y = 'Revenue (Millions)')

# Quarter released
quarter_plot <- films_train %>% 
  ggplot(aes(x = factor(release_quarter),
             y = revenue,
      fill = factor(release_quarter))) +
  stat_summary_bin(fun = median, geom = "bar") +
  scale_y_continuous(breaks = c(0, 10000000, 20000000),
                      labels = c('$0', '$10', '$20')) +
   theme_classic() +
   theme(legend.position = 'none', axis.text.x = element_text(angle = 90)) +
  scale_fill_manual(values = wes_colors4) +
   labs(title = 'Revenue by quarter released', 
        x = 'Release quarter', 
        y = 'Median revenue (Millions)')
  
# Month released

month_plot <- films_train %>% 
  ggplot(aes(x = factor(release_month_abbr),
             y = revenue,
             fill = release_month)) +
  stat_summary_bin(fun = median, geom = "bar") +
   scale_y_continuous(breaks = c(0, 10000000, 20000000, 30000000),
                      labels = c('$0', '$10', '$20', '$30')) +
   theme_light() +
   theme(legend.position = 'none', axis.text.x = element_text(angle = 45)) +
   labs(title='Median revenue by month released', x='Release month', y='Median revenue (Millions)')



# Week released
week_plot <- films_train %>% 
  ggplot(aes(x = factor(release_week),
             y = revenue, 
             fill = factor(release_week))) +
  stat_summary_bin(fun = median, geom = "bar") +
   scale_y_continuous(breaks = c(0, 20000000, 40000000, 60000000),
                      labels = c('$0', '$20', '$40', '$60')) +
   theme_light() +
   theme(legend.position = 'none', axis.text.x = element_text(angle = 90)) +
   labs(title='Revenue by week released', x='Release week', y='Median revenue (Millions)')

# Weekday released
weekday_plot <- films_train %>% 
  ggplot(aes(x = reorder(factor(release_wday_name), release_wday), 
             y = revenue,
             fill = release_wday_name)) +
   stat_summary_bin(fun = median, geom = "bar") + 
   scale_y_continuous(breaks = c(0, 10000000, 20000000, 30000000),
                      labels = c('$0', '$10', '$20', '$30')) +
   theme_light() +
   theme(legend.position = 'none', axis.text.x = element_text(angle = 45)) +
   labs(title = 'Revenue by weekday released', x='Release day', y='Median revenue (Millions)')

# Create a grid of the plots.
grid.arrange(year_plot, quarter_plot, month_plot, weekday_plot, week_plot,
             layout_matrix = rbind(c(1, 2, 3),
                                   c(5, 5, 4)))

Distribution of Films across Years:

The following code is taken from Saba Tavoosi

columnchart_by_year <- function(dataset, color, subtitle){
  dataset %>%
  group_by(release_year) %>% 
  count() %>% 
  ggplot() + 
  aes(x = release_year,
      y = n,
      fill = release_year) +
  geom_col(fill = color) +
  theme_minimal() +
  theme(legend.position = "none") +
  labs(title = "Distribution of Released Films by Release Year",
       subtitle = subtitle,
       x = "Year", 
       y = "Number of Films",
       caption = "Source: kaggle")
}

Films Distribution by Release Year

films_train %>%
  group_by(release_year) %>% 
  count() %>% 
  arrange(desc(n)) %>%
  head(5) %>% 
  pander()
release_year n
2010 95
2013 93
2016 93
2011 92
2015 85
films_test %>%
  group_by(release_year) %>% 
  count() %>% 
  arrange(desc(n)) %>% 
  head(5) %>% 
  pander()
release_year n
2014 197
2013 194
2011 187
2015 184
2010 180
year_column_chart_train <- columnchart_by_year(train, "lightcyan3", "from training dataset'")
year_column_chart_test <- columnchart_by_year(test, "lightcyan4", "from testing dataset'")



grid.arrange(year_column_chart_train,
             year_column_chart_test,
             nrow = 2)

The most popular years for a film in our data set to be released were 2013, 2014, and 2015 with 335, 320, and 312 films respectively. Our data set only had 1 film coming from 1923, 1924, 1929, 2018, 2021. We have a skewed left distribution of films across the years with the earliest coming from 1923 and the most recent from 2018.

It appears that our testing and training data set both follow a skewed left distribution with a peak near the 2010 and 2011 years. A conspicuous difference that will potentially affect the difference in accuracy our model will predict on the testing and testing data sets will be the noticeable dip in the number of films released in 2009 within the testing data set.

I have reservations about how this training data set mirrors the distribution of films in the real world. It appears that this data set has an oversampling bias for the early 2000’s. For example, Allen J Scott has found a more even distribution of films than the spike depicted in our data set. This data set also does not feature many releases from 2021 and 2022, thereby creating a distribution that poorly resembles the findings of Statista For example, I find it concerning that this data set doesn’t contain films from the most recent years, especially as these films may contain information related to the pandemic’s tumultuous effect on the film industry. In other words, the absence of these films from more recent years may potentially underestimate the effect the pandemic has had on future films. If the data set does contain films in more recent years, I would intend to engineer a new variable that classifies film as “pre-pandemic”, “immediately post-pandemic films” (i.e. films released one year of March 10, 2020) and “late post-pandemic films” (i.e. films released after March 10, 2022.).

Exploring patterns and profitability with Budget:

The following code is taken from Saba Tavoosi

# Budget
films_train %>% 
  ggplot(aes(x = budget, 
             y = revenue, 
             color = budget)) +
  geom_point() +
  geom_smooth(method = 'gam',
              color = "black", 
              fill = 'gray') +
  # scale_x_log10()
  scale_y_continuous(breaks = c(0, 500000000, 1000000000, 1500000000),
                      labels = c('$0', '$500', '$1000', '$1500')) +
  scale_x_continuous(breaks = c(0, 50000000, 100000000, 200000000, 300000000), 
                      labels = c('$0', '$50', '$100', '$200', '$300')) +
   theme_classic() +
   theme(legend.position = 'none') +
   labs(title = 'Revenue by budget', x = 'Budget (Millions)', y = 'Revenue (Millions)')

While not too many observations can be drawn from this preliminary graph, one can see how in general, a higher budget film may lead to higher revenue. This is of course a trend with many opportunities to show a high budget film with a low revenue and vice versa.

Counts and Box Office Sales By Genre:

The following code is taken from Saba Tavoosi

# Movie counts by main genre
films_train %>% 
  ggplot() +
  aes(x = fct_infreq(main_genre),
      fill = main_genre) +
  geom_bar() +
  theme_bw() +
  coord_flip() +
  theme(legend.position = 'none') + 
  labs(title = 'Genre by count', x = 'Genre', y = 'count')

#Boxplot per Genre

films_train %>% 
  ggplot(aes(x = fct_infreq(main_genre),
             y = revenue,
             fill = main_genre)) +
  stat_summary() +
  coord_flip() +      
  scale_y_continuous(breaks = c(0, 50000000, 100000000, 200000000),
                      labels = c('$0', '$50', '$100', '$200')) +
  coord_flip() +
  theme_classic() +
  theme(legend.position = 'none') +  
  labs(title = 'Median revenue by genre', x = 'Genre', y = 'Median revenue (Millions)')

The bar plot above shows that Drama, Comedy, and Action were the most popular film releases in our data set. On the other end of the spectrum, Foreign films, History, and Western were the last popular. It should be noted that the classifications of genre may, out of simplification, choose drama and comedy as they are typically the dominant genre to describe a film. For example, a murder comedy mystery may be classified as a comedy. A further analysis could break these genres into further subsets and capture the nuances of more specified genres and how it affect revenue.

The second plot shows a five number summary (Min, Q1, Median, Q3, Max) breakdown by Genre. We can observe that Science Fiction has a wide distribution whereas as Adventure, with a smaller distance between their minimum and maximum values, generally performs the best out of all the other genres. On the other end, we see that Documentary films and foreign films on average, tend to have the lowest box office sales when compared to other genres.

Comparing Box Office Sales between the top production companies

The following code is taken from Saba Tavoosi

#Place this in the clean dataset
films_train %>% 
  filter(!is.na(top_prod_comp)) %>% 
  ggplot() +
  aes(x = top_prod_comp,
      y = revenue, 
      fill = top_prod_comp) +
  geom_boxplot() +
  scale_y_continuous(breaks = c(0, 500000000, 1000000000, 1500000000),
                      labels = c('$0M', '$500', '$1000', '$1500')) +
   coord_flip() +
   theme_classic() + 
   theme(legend.position = 'none') +
   labs(title = 'Revenue by top production companies',
        x = 'Top production companies', y = 'Revenue (Millions)')

When broken down by Production Companies, we see that Walt Disney seems to outperform its competitors with the highest median and Q3. While Universal Pictures and Paramount Pictures may have a smattering of films that have higher box office sales than Walt Disney highest grossing film, the body of these two company s’ inner quartile range is solidly lower than Disney median.

However, the author of this report suggests the reader to apply the results of this boxplot only to the data set available. For example, it is evident that Walt Disney has produced a number of top grossing films but are not represented in some of the largest outliers. For example, from this plot, I was able to confirm that this box plot does not include Avatar or Avengers:Endgame the highest and second highest grossing films of all time, respectively.

Revenue by Year

The following code is taken from Saba Tavoosi

films_train %>%
  group_by(release_year) %>% 
  summarize(average_boxoffice = mean(revenue)) %>% 
  ggplot() + 
  aes(x = release_year,
      y = average_boxoffice) +
  geom_col() +
  theme_minimal() +
  theme(legend.position = "none") +
  labs(title = "Average Box Office Sales ",
       subtitle = "by Release Year",
       x = "Year", 
       y = "Average Box Office Sales",
       caption = "Source: kaggle, trained data")

As predicted, with a combination of more volume increase and inflation, the overall average box office sales per year has increased since 1925 according to our data. The revenue values do not account for inflation.

Sequel or Not:

# Movie Count by Sequel Stats
films_train %>% 
  ggplot() +
  aes(x = fct_infreq(order),
      fill = order) +
  geom_bar() +
  theme_bw() +
  coord_flip() +
  theme(legend.position = 'none') + 
  labs(title = 'Order by count', x = 'Order', y = 'Count') +
  scale_fill_manual(values = wes_colors4)

# Revenue against Order of film in franchise

films_train %>% 
  ggplot(aes(x = order,
             y = revenue,
             fill = order)) +
  geom_boxplot() +
  scale_fill_manual(values = wes_colors4) + 
  theme(legend.position = "none")

Our results indicate that our dataset contained more films classified as “after” but films classified as “before” overall performed better at the box office as depicted in the boxplot.

Initial Modeling

Splitting Test and Train Data Sets

See above after the filtering of the data set to see the training and testing split.

Setup

I aim to use the variable revenue as the exploratory variable with budget, popularity, production_companies, production_countries, release_date, runtime, release_year, release_month, release_quarter, release_week, release_wday, all_cast_size, female_cast_size, male_cast_size, and genre_count as explanatory variables.

Initial Model: Linear Regression Cross Validation

Both of these two models will use the same formula.

Linear Regression using Cross Validation

film_formula1 <- revenue ~ budget + popularity + production_companies + release_date + runtime + release_year + release_month + release_quarter + release_week + release_wday + is_english + is_top_prod_comp

formula1 <- revenue ~ budget + popularity + release_date + runtime + release_year + release_month + release_quarter + release_week + release_wday + all_cast_size + female_cast_size + male_cast_size + genre_count

linreg_recipe1 <- recipe(
  film_formula1,
  data = train) %>% 
  step_other(production_companies)

linreg_workflow3 <- workflow(
  preprocessor = linreg_recipe1,
  spec = linear_reg()
)


#Using Cross Validation
model1_samples <- fit_resamples(linreg_workflow3,
                                resamples = film_resamples,
                                metrics = metric_set(mae))


model1_samples %>%
  collect_metrics(summarize = TRUE)
## # A tibble: 1 × 6
##   .metric .estimator  mean     n std_err .config             
##   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1 mae     standard    56.6     5    2.04 Preprocessor1_Model1

Linear Regression, Decision Tree, Random Forest

I proceed to create three new models: A linear regression, decision tree, and a second random forest model. These three models will use the same formula with the same explanatory variables. However, this formula includes all_cast_size,female_cast_size, male_cast_size, and genre_count while eliminating production_companies.

Linear Regression

formula1 <- revenue ~ budget + popularity + release_date + runtime + release_year + release_month + release_quarter + release_week + release_wday + all_cast_size + female_cast_size + male_cast_size + genre_count

#creating a function to fit a model given a the model type.
fit_model <- function(type){
  model_formula <- formula1
  spec <- type(mode = "regression")
name <- fit(
  spec,
  model_formula,
  data = train) 
  
name
}

#creating a function to plot a model on a budget v. revenue scatterplot
model_scatter_plot <- function(model){
  augment(model, train) %>% 
  ggplot(aes(x = budget, y = .pred)) +
  geom_line() +
  geom_point(aes(y = revenue), alpha = .05) +
  scale_x_log10()
}

#Fitting Model
linear_model1 <- fit_model(linear_reg)

#Plotting Model
model_scatter_plot(linear_model1)

Decision Tree

dectree_recipe <- recipe(formula1,
                         data = train)
  
dectree_workflow <- workflow(
  preprocessor = dectree_recipe,
  spec = decision_tree(mode = "regression", 
                       tree_depth = 5,
                       cost_complexity = .000001, 
                       min_n = 2)
)

#Fitting Model
dt_model <- fit(dectree_workflow, data = train)

#Plotting Model
model_scatter_plot(dt_model)

#Decision Tree Plot
dt_model %>%
  extract_fit_engine() %>% 
  rpart.plot(roundint = FALSE, digits = 3, type = 5)

Interpretation of Decision Tree:

For context, Saba Tavoosi created a random forest model with an accompanying importance plot that communicated that budget, runtime, and popularity were the three most important variables in her model. This importance plot was derived from a correlation plot of all of her variables.

Our Decision Tree plot confirms the importance of budget, runtime and the popularity in predicting a film’s revenue. We find that our decision tree model first split on budget, with large majority of films having been produced with a budget under 103 million dollars. We notice an overall pattern where films on the right side of the decision tree plot that have been categorized as having a budget over 103 million dollars have predictor values in their terminal nodes that are higher than the terminal node values on the left hand side. For example, our particular decision tree model, dt_model, would predict that a film that has a budget of 98 million dollars, with a popularity score of 12, would make around 250 million dollars along with 2.1% of the films in the training data set. A producer looking to be in the exclusive club of passing the 1 billion dollar mark would pay close attention to routes to the terminal nodes with values greater than 1,000 (in millions). In this case, a our model would predict any film that had a budget over 290 million dollars, had a popularity score above or equal to 26.7, and had less than 3.5 genres. Overall, I believe this decision tree is making logical conclusions about the data set.

Random Forest Model

formula1 <- revenue ~ budget + popularity + release_date + runtime + release_year + release_month + release_quarter + release_week + release_wday + all_cast_size + female_cast_size + male_cast_size + genre_count


randforest_recipe <- recipe(
  formula1,
  data = train) %>% 
  step_normalize(all_numeric_predictors())
  
randforest_workflow <- workflow(
  preprocessor = randforest_recipe,
  spec = rand_forest(mode = "regression")
)


#Fitting Model
rf_model <- fit(randforest_workflow, data = train)

#Plotting Model
model_scatter_plot(linear_model1)

Accuracy Metrics

MAE and MAPE

Now evaluating the accuracy on the test dataset.

# MAE Error Metric Function given one of the fitted models
mae_error_metrics <- function(model){
  augment(model, test) %>% 
    mae(truth = revenue, estimate = .pred) %>% 
    pander()
}

# MAPE Error Metric Function given one of the fitted models
mape_error_metrics <- function(model){
  augment(model, test) %>% 
    mape(truth = revenue, estimate = .pred) %>% 
    pander()
}

#Error Metrics
mae_error_metrics(linear_model1)
.metric .estimator .estimate
mae standard 59.3
mape_error_metrics(linear_model1)
.metric .estimator .estimate
mape standard 102109
mae_error_metrics(dt_model)
.metric .estimator .estimate
mae standard 57.45
mape_error_metrics(dt_model)
.metric .estimator .estimate
mape standard 22013
mae_error_metrics(rf_model)
.metric .estimator .estimate
mae standard 50.89
mape_error_metrics(rf_model)
.metric .estimator .estimate
mape standard 13277

Accuracy Graphs

pred_vs_obs <- function(model, subtitle){
  augment(model, test) %>%
  ggplot(aes(x = revenue, 
             y = .pred, 
             color = is_top_prod_comp)) +
  geom_point(alpha = .85) +
  coord_obs_pred() +
  geom_abline() +
  labs(title = "Predicted v. Observed Scatterplot",
       subtitle = subtitle,
       x = "Observed",
       y = "Predicted",
       color = "Top Production Company") +
  scale_color_manual(values = wes_colors4)
} #function to create a predicted v. observed plot

pred_vs_obs(linear_model1, "linear regression")

pred_vs_obs(dt_model, "decision tree")

pred_vs_obs(rf_model, "random forest")

glue("We have obatined a mean aboslute error of {mae_error_metrics(linear_model1)[3]}, {mae_error_metrics(dt_model)[3]}, and {mae_error_metrics(rf_model)[3]} for our linear regression, decission tree regression, and random forest regreesion models, respectively.")
## We have obatined a mean aboslute error of NA, NA, and NA for our linear regression, decission tree regression, and random forest regreesion models, respectively.
glue("We have obatined a mean aboslute percentage error of {mape_error_metrics(linear_model1)[3]}, {mape_error_metrics(dt_model)[3]}, and {mape_error_metrics(rf_model)[3]} for our linear regression, decission tree regression, and random forest regreesion models, respectively.")
## We have obatined a mean aboslute percentage error of NA, NA, and NA for our linear regression, decission tree regression, and random forest regreesion models, respectively.

Interpretation of Error Metrics and Predicted v. Observed Plots:

Our findings show that our ‘best’ model with the lowest range of error (MAE) was the second random forest model, though even this model could be improved.

In our Predicted v. Observed plots, we can confirm that our second random forest model performed the best as many of the data points fell close to the diagonal black line. A perfect fit would have all data points lined up on the diagonal black line signifying that the predicted value is the same as the observed value. This was not apparent in the linear regression model and especially not in the decision tree model. In both the linear regression model and our second random forest model, we noticed that the model would start to underestimate the film’s revenue with larger budgets.

Conclusions:

Findings

Given the instances in the discrepancy between the films represented in the data set used to create our four models and the total number of films, I would advise the reader to limit their applications of these models on other films.

The results of the models, particularly the random forest and decision tree models indicate that a larger budget, in general, results in a larger box office return (revenue). Other important factors to consider are the popularity of films. This, however, is unhelpful to a line producer as this data point is recorded after the release of a film where the budget has already been finalized. One way to work with this data is to examine more closely the content of your upcoming film and how it may compare to your previous popular films within your . It is apparent and confirmed in these particular models that people are more willing to see more popular films so it is advisable to produce a film that mirrors successful models. However, this model fails to capture the value of originality which affects the popularity of a film (i.e. Everything Everywhere All At Once, an innovative film and largest grossing film for A24).

Another large factor to consider is that the variable popularity is retrospective and essentially allows the model to cheat by using data after a film has been released. The popularity variable, similar to the our response revenue variable is only known after the film’s release or after both critics and audiences have had a chance to see the film. In other words, in actual application, it would be impossible to include popularityin our model as it would be determined after the response variable itself. Potentially, one could use the rotten tomato score or the early IMDB rating from early viewers as a substitute variable to use in place of popularity. However, this risks including misleading information from review bombing where a film can still be successful despite a number of passionate film reviewers giving negative reviews early on before the film has even been released. What happens to our models if we remove the variable popularity from our model formulas? Let’s examine the best model we had, the random forest model. We use the same preprocessing and error metrics but only alter the formula to not include popularity

Adjusting for the removal of popularity

formula2 <- revenue ~ budget +  release_date + runtime + release_year + release_month + release_quarter + release_week + release_wday + all_cast_size + female_cast_size + male_cast_size + genre_count #removed popularity

randforest_recipe <- recipe(
  formula2,
  data = train) %>% 
  step_normalize(all_numeric_predictors())
  
randforest_workflow <- workflow(
  preprocessor = randforest_recipe,
  spec = rand_forest(mode = "regression")
)


#Fitting Model
rf_model2<- fit(randforest_workflow, data = train)

#Error Metrics

mae_error_metrics(rf_model)
.metric .estimator .estimate
mae standard 50.89
mae_error_metrics(rf_model2)
.metric .estimator .estimate
mae standard 54.71

As predicted, our model has worsened in its accuracy when we take out one of its key predictor variables. Intuitively, popularity could potentially have a correlation with revenue.

Strengths of the Trained Data Set and Models:

While I have been critical of the representation of the data, I commend the data set for having including detailed information about each film that would be able to add nuance to the model. In other words, while the data set may be lacking in quality of the rows, it compensates with the quality of the columns. However, there still exists a multitude of other factors that can be strong determinants a box office success.

Limitations of the Trained Data Set and Models:

As iterated in the executive summary, I hold reservations from solely using this model in your exploration. Consider other models that are able to process film data about differentiating creative inputs (i.e. Directors, use of color palette, cinematographers, level of profanity/nudity/thematic content) that will affect the quality of the film and its likelihood to be well received by your specific demographics for whom the film was marketed. Consider also the ethical implications that this model fails to address. Consider the legal ramifications if the casting decides show gender bias based on the evidence of this model.

Future Direction

Continually seek better data that can generate more insights into the film industry and what are the key drivers that drive up the willingness to pay. The author of this report encourages seeking a balance between creative integrity of film-making and the business end of the industry.

Citations

Saba Tavoosi: https://www.kaggle.com/code/tavoosi/predicting-box-office-revenue-with-random-forest